Search Result

Select

Performance optimization strategy of distributed storage for industrial time series big data based on HBase

Li YANG, Jianting CHEN, Yang XIANG

Journal of Computer Applications 2023, 43 (3): 759-766. DOI: 10.11772/j.issn.1001-9081.2022020211

Abstract （388）

HTML （15）

PDF （2121KB）（167）

PDF（mobile）（619KB）（12）

Save

In automated industrial scenarios， the amount of time series log data generated by a large number of industrial devices has exploded， and the demand for access to time series data in business scenarios has further increased. Although HBase， a distributed column family database， can store industrial time series big data， the existing strategies cannot meet the specific access requirements of industrial time series data well because the correlation between data and access behavior characteristics in specific business scenarios is not considered. In view of the above problem， based on the distributed storage system HBase， and using the correlation between data and access behavior characteristics in industrial scenarios， a distributed storage performance optimization strategy for massive industrial time series data was proposed. Aiming at the load tilt problem caused by characteristics of industrial time series data， a load balancing optimization strategy based on hot and cold data partition and access behavior classification was proposed. The data were classified into cold and hot ones by using a Logistic Regression （LR） model， and the hot data were distributed and stored in different nodes. In addition， in order to further reduce the cross-node communication overhead in storage cluster and improve the query efficiency of the high-dimensional index of industrial time series data， a strategy of putting the index and main data into a same Region was proposed. By designing the index RowKey field and splicing rules， the index was stored with its corresponding main data in the same Region. Experimental results on real industrial time series data show that the data load distribution tilt degree is reduced by 28.5% and the query efficiency is improved by 27.7% after introducing the optimization strategy， demonstrating the proposed strategy can mine access patterns for specific time series data effectively， distribute load reasonably， reduce data access overhead， and meet access requirements for specific time series big data.

Table and Figures | Reference | Related Articles | Metrics

Select

Text semantic de-duplication algorithm based on keyword graph representation

Jinyun WANG, Yang XIANG

Journal of Computer Applications 2023, 43 (10): 3070-3076. DOI: 10.11772/j.issn.1001-9081.2022101495

Abstract （194）

HTML （17）

PDF （1266KB）（106）

Save

There are a large number of redundant texts with the same or similar semantics in the network. Text de-duplication can solve the problem that redundant texts waste storage space and can reduce unnecessary consumption for information extraction tasks. Traditional text de-duplication algorithms rely on literal overlapping information， and do not make use of the semantic information of texts； at the same time， they cannot capture the interaction information between sentences that are far away from each other in long text， so that the de-duplication effect of these methods is not ideal. Aiming at the problem of text semantic de-duplication， a long text de-duplication algorithm based on keyword graph representation was proposed. Firstly， the text pair was represented as a graph with the keyword phrase as the vertex by extracting the semantic keyword phrase from the text pair. Secondly， the nodes were encoded in various ways， and Graph Attention Network （GAT） was used to learn the relationship between nodes to obtain the vector representation of text to the graph， and judge whether the text pairs were semantically similar. Finally， the de-duplication processing was performed according to the text pair’s semantical similarity. Compared with the traditional methods， this method can use the semantic information of texts effectively， and through the graph structure， the method can connect the distant sentences in the long text by the co-occurrence relationship of keyword phrases to increase the semantic interaction between different sentences. Experimental results show that the proposed algorithm performs better than the traditional algorithms， such as Simhash， BERT （Bidirectional Encoder Representations from Transformers） fine-tuning and Concept Interaction Graph （CIG）， on both CNSE （Chinese News Same Event） and CNSS （Chinese News Same Story） datasets. Specifically， the F1 score of the proposed algorithm on CNSE dataset is 84.65%， and that on CNSS dataset reaches 90.76%. The above indicates that the proposed algorithm can improve the effect of text de-duplication tasks effectively.

Table and Figures | Reference | Related Articles | Metrics

Select

Chinese event detection based on data augmentation and weakly supervised adversarial training

Ping LUO, Ling DING, Xue YANG, Yang XIANG

Journal of Computer Applications 2022, 42 (10): 2990-2995. DOI: 10.11772/j.issn.1001-9081.2021081521

Abstract （622）

HTML （50）

PDF （720KB）（299）

Save

The existing event detection models rely heavily on human-annotated data， and supervised deep learning models for event detection task often suffer from over-fitting when there is only limited labeled data. Methods of replacing time-consuming human annotation data with auto-labeled data typically rely on sophisticated pre-defined rules. To address these issues， a BERT （Bidirectional Encoder Representations from Transformers） based Mix-text ADversarial training （BMAD） method for Chinese event detection was proposed. In the proposed method， a weakly supervised learning scene was set on the basis of data augmentation and adversarial learning， and a span extraction model was used to solve event detection task. Firstly， to relieve the problem of insufficient data， various data augmentation methods such as back-translation and Mix-Text were applied to augment data and create weakly supervised learning scene for event detection. And then an adversarial training mechanism was applied to learn with noise and improve the robustness of the whole model. Several experiments were conducted on commonly used real-world dataset Automatic Context Extraction （ACE） 2005. The results show that compared with algorithms such as Nugget Proposal Network （NPN）， Trigger-aware Lattice Neural Network （TLNN） and Hybrid-Character-Based Neural Network （HCBNN）， the proposed method has the F1 score improved by at least 0.84 percentage points.

Table and Figures | Reference | Related Articles | Metrics

Select

Collaborative filtering and recommendation algorithm based on matrix factorization and user nearest neighbor model

YANG Yang XIANG Yang XIONG Lei

Journal of Computer Applications 2012, 32 (02): 395-398. DOI: 10.3724/SP.J.1087.2012.00395

Abstract （1465）

PDF （660KB）（1419）

Save

Concerning the difficulty of data sparsity and new user problems in many collaborative recommendation algorithms, a new collaborative recommendation algorithm based on matrix factorization and user nearest neighbor was proposed. To guarantee the prediction accuracy of the new users, the user nearest neighbor model based on user data and profile information was used. Meanwhile, large data sets and the problem of matrix sparsity would significantly increase the time and space complexity. Therefore, matrix factorization was introduced to alleviate the effect of data problems and improve the prediction accuracy. The experimental results show that the new algorithm can improve the recommendation accuracy effectively, and solve the problems of data sparsity and new user.

Reference | Related Articles | Metrics

Select

Research on topic maps-based ontology information retrieval model

LI QingMao XingJiang Yang Xiang-Bing Zhou

Journal of Computer Applications 2010, 30 (1): 240-242.

Abstract （1733）

PDF （506KB）（906）

Save

Ontology is normative, explicit and reusable when defining the domain concept, so it can be combined with topic maps to organize information resource for semantic navigation. An information retrieval model based on topic maps and ontology was proposed and defined formally. Firstly it specified a domain of tourism document. Secondly it defined the ontology and topic maps of tourism document in order to normalize query that user directly input in natural language, and identified the users real meaning of search. Thus, it can expand user' semantic search. Therefore analyzed the effect of the ontology was analyzed, and a valuable function of semantic navigation and sorting the retrieval result correlated with user's query was shown. Finally，the experimental result shows that the topic mapbased ontology information retrieval model can perform better than the traditional model.